You can view, download or pull the Rmarkdown file for this content here https://github.com/michellebyrne1/MonashHonoursStatistics
We will be using a few packages for this content and also the baseline data only from the data collection exercise.
library(haven)
library(data.table)
library(JWileymisc)
library(ggplot2)
library(ggpubr)
library(visreg)
## read in data
db <- as.data.table(read_sav("[2021] PSY4210 BL.sav")) # baseline
Before we talk about generalized linear models, let’s review linear regression. A simple linear regression is based off the equation:
\[ y_i = b_0 + b_1 * x_i + \varepsilon_i \]
where:
The subscript \(i\) indicates that while the model parameters, \(b_0\) and \(b_1\), the regression coefficients, are the same for all participants. Each person (or observation) has its own value of \(y\) and its own value of \(x\) and because the model is not likely perfect, there will be some unexplained residual, \(\varepsilon_i\).
If we want to talk about only what is predicted based on the regression coefficients (the model parameters), we often write it:
\[ \hat{y_i} = b_0 + b_1 * x_i \]
leaving off the residual error term, \(\varepsilon_i\).
If you are looking at the HTML file, you may wonder where the R code to make the graphs has gone. It is still there in the Rmarkdown file, but as this part of the content is focused on conceptual understanding, not how to create graphs in R, I have hidden it by asking R not to echo [show] the source code in the final HTML file being created from the Rmarkdown file.
Visually, the intercept, \(b_0\), and slope, \(b_1\), coefficients/parameters look like the following figure.
Example image of a simple, linear regression model.
Changes to the intercept indicate that the height, sometimes called the level, of the line is shifted. Positive values of the intercept, that is \(b_0 > 0\), will shift the line up. For example as shown in the following figure. Negative fvalues of the intercept, that is \(b_0 < 0\), will shift the line down. Just because the intercept changes does not mean the slope has to change.
Example image of a simple, linear regression model with a different intercept value.
Changes to the slope indicate that the steepness of the line is shifted. Positive values indicate a more positive (steeper) slope, such that as \(x\) increases, \(y\) also is expected to increase. Negative values indicate a more negative (also maybe steeper declining) slope, such that as \(x\) increases, \(y\) is expected to decrease.
Example image of a simple, linear regression model with a different slope.
In simple linear regression, we are not creating an arbitrary line, we are creating the line of best fit. Here, “best fit” means the line that minimizes the sum of squared residuals.
To better understand residuals, let’s create a visual look at the residuals in a simple linear regression. To keep this example small, we will use a little dataset built into R, the mtcars dataset, which has data on just 32 cars. The following graph shows the scatter plot of two variables, the weight of cars (wt) and their horse power (hp). The solid black line is the regression line of best fit, and the blue lines between each observed value and the line are the residuals. The residuals capture the difference between what the model predicted and what was actually observed. That leads to another way we can think about and write the equation for residuals: the difference between the observed and predicted outcome values.
\[ \varepsilon_i = y_i - \hat{y_i} \]
When we fit a linear regression, equations are solved to give estimates of \(b_0\) and \(b_1\) that minimize the sum of squared residuals, and those parameter estimates (or regression coefficients) produce the line of best fit.
Multiple linear regression works in principle basically the same way as does simple linear regression. The difference is that more then one predictor (explanatory) variable is allowed in a single model. As before we have an outcome, \(y\) predicted by the combination of model parameters, the \(b\)s but instead of only one predictor, \(x\), we can have an indeterminate number, \(k\). It could be two predictors, three, twenty, the number does not change the underlying model.
\[ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i \]
The regression coefficients (or more generally referred to as model parameters, where in this case the model is a regression), are interpretted fairly similarly to those in simple linear regression, but with some extra requirements.
Visually, it becomes more difficult to visualize, but remains possible with a 3D graph. The following figure shows the regression surface predicted by a multiple regression of neuroticism on both stress and selfesteem. Spend some time moving it around to view it from different angles and see the unique association between stress and neuroticism as well as the unique association between selfesteem and neuroticism. The word “unique” here is important as it conveys that this is not two simple linear regressions–in a multiple regression, the regression coefficients give the unique, or independent, association of each predictor with the outcome independent, or controlling for, the other predictors.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The linear regression models we saw fit more broadly into what are often called “linear models”, which is a class of models that are all fundamentally equivalent in that they are linear models and include t-tests, analysis of variance, pearson correlations, and linear regressions as specific types of linear models.
All these cases share in common that a linear association between predictors or explanatory variables and the outcome is assumed, and the outcome must be continuous and is assumed to be normally distributed.
The linear model is itself a special case of the generalized linear model or GLM. In particular, the GLM extends the linear model to different outcomes. A few examples are:
This is beneficial because it is fairly common in the real world to cross a variable that is not normally distributed and yet may be important to study. In this content, we are going to focus on the GLM for normal outcomes, i.e., linear regression. However, we will move towards a GLM framework as seeing and understanding linear regression through the GLM framework will allow us to use other types of GLMs for non-normal data more easily.
We already learned about linear regression
\[ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i \]
First, we separate the model into pieces. For simplicity, I am going to drop the “i” subscript and use a boldface font to indicate when a variable is a vector that would have different observed values for each person. The parameters (e.g., \(b_0\)) are not bolded because they are not vectors at this point.
\[ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} \]
\[ \boldsymbol{y} = \boldsymbol{\eta} + \boldsymbol{\varepsilon} \]
You have probably heard many times that in linear regression, we assume a normal distribution. Before we jump into the assumption for linear regression, we need to talk a bit more about probability distributions. A probability distribution is a function that maps a value to a particular probability of occurring. Most probability distributions we deal with, such as the normal distribution, is not a single distribution, but a family of distributions. This is because the specific normal distribution we get depends on some parameters.
In particular, the normal distribution has two parameters.
Formally, the normal distribution can be written:
\[ \mathcal{N}(\mu, \sigma) \]
When we talk about a normal distribution, we may think of the standard normal distribution, which has mean zero and unit variance/standard deviation:
\[ \mathcal{N}(0, 1) \]
However, there are many normal distributions. The following figure shows a few different normal distributions, some with the same mean and different standard deviations, some with different means. What this shows is that to figure out the probability, it is not enough to specify that you are assuming a normal distribution, you must also specify which specific normal distribution by specifying its parameters, the mean and standard deviation.
Getting back to linear regression, what does it really mean when we say that normality is assumed? You may have heard that the outcome has to be normally distributed, but that is not quite true. In fact, what we assume is that the outcome is conditionally normal, conditional on the model. In practice, this tends to written one of two ways, which are effectively equivalent.
\[ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_\varepsilon) \]
Which is read, “Y is distributed as a normal distribution with mean equal to \(\boldsymbol{\eta}\) and standard deviation \(\sigma\)”.
\[ \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma_\varepsilon) \]
The only difference is whether we use the residuals and assume a mean of 0 (because residuals are always constructed to have mean 0) or the raw outcome variable, but assume mean equal to \(\boldsymbol{\eta}\). In effect, we are assuming that the outcome is distributed normally around the regression line, with standard deviation based on the standard deviation of the residuals.
Any GLM has to be a linear model at some level, that is, this is always true:
\[ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} \]
indeed, even t-tests and ANOVAs can be fit into this framework. The GLM generalizes linear regression by defining a function that links or transforms \(\boldsymbol{\eta}\) from a linear space to the outcome space.
The link function is always called \(g()\). Likewise an inverse link function exists that simply “undoes” whatever the link function did. This is called \(g()^{-1}\).
In normal linear regression, the link function and inverse link function are the identity function, by default, so there is no transformation. For other GLMs, the link and inverse link functions do some transformation.
In terms of formulae, we define the expected value of \(\boldsymbol{y}\), its mean, as \(\mu\)
Then
For normal linear regression because the link and inverse link functions are the identity function this works out to be
The powerful concept of the GLM is that by using link functions, the linear model
\[ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} \]
always stays the same whether the outcome is continuous and normally distributed or binary and not normally distributed or count or any other type of variable and assumed distribution. This means in some sense, all GLMs are interpreted like linear regression, which facilitates modelling and interpretation.
Now, let’s take a look at linear regression in R and work through examples running and interpreting the results.
RLinear regression in R is generally conducted using the lm() function, which fits a linear model. It uses a formula interface to specify the desired model, with the format: outcome ~ predictor. We can store the results of the linear regression in an object, m, and then call various functions on this to get further information, for example summary() to get a quick summary or confint() to get confidence intervals.
When we used summary() we get quite a bit of output. The headings are:
R to produce the model, its a handy reminder of the variables and outcome especially if you are running and saving many models.stress coefficient. In multiple regression models, they will not be identical.Using confint() gives us 95% confidence intervals for the parameter estimates, the regression coefficients.
m <- lm(neuroticism ~ stress, data = db)
summary(m)
##
## Call:
## lm(formula = neuroticism ~ stress, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5594 -1.2344 0.0906 1.4406 4.2781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7907 0.6671 4.183 5.06e-05 ***
## stress 0.4188 0.0631 6.636 6.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.923 on 139 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2406, Adjusted R-squared: 0.2351
## F-statistic: 44.03 on 1 and 139 DF, p-value: 6.659e-10
confint(m)
## 2.5 % 97.5 %
## (Intercept) 1.4717162 4.1097197
## stress 0.2939791 0.5435158
If you look at this in the Rmarkdown file, its quite complex as I am having R input the relevant numbers directly into the text. You don’t need to do this. I hate copying and pasting and want to make sure I don’t make errors that may confuse interpretation.
We could interpret the results of the regression as follows.
There was a stastistically significant association between stress and neuroticism. A one unit higher stress score was associated with a 0.42 [95% CI 0.29 - 0.54] higher neuroticism score, p < .001. Stress explained 23.5% of the variance in neuroticism.
We can also use the visreg package to help us visualize the results from the regression to better understand what it means. The defaults work fairly well in this case, giving us the regression line (in blue) and the shaded grey region shows the 95% confidence interval around the regression line. Partial residuals are plotted as well by default. We use the argument, gg = TRUE to ask visreg() to make it a ggplot2 graph which lets us do things like customize the theme and work with the usual ggplot2 graphing framework we are accustomed to now.
visreg(m, xvar = "stress", gg = TRUE) +
theme_pubr()
simple regression plot from visreg
If you want just the regression line, you can set parital = FALSE and rug = FALSE to stop plotting partial residuals. As with other graphs in ggplot2, we can use + to add additional elements, beyond the theme, like controlling the title.
visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
ggtitle("Linear regression of neuroticism on stress",
subtitle = "Shaded region shows 95% confidence interval.")
simple regression plot from visreg with customization
You can use the annotate() function to add annotations to the graph, such as a label indicating the coefficient and p-value. Figuring out the best position and size to be attractive is typically a process of some trial and error in any graph. Play around the with x and y coordinates and the size until you are happy with it.
visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
annotate("text", x = 15, y = 8, label = "b = 0.XX, p < .0001", size = 4) +
ggtitle("Linear regression of neuroticism on stress",
subtitle = "Shaded region shows 95% confidence interval.")
simple regression plot from visreg with even more customization
# Can you fill in the b coefficient in the label from the summary of m above?
You might find this page helpful for further explanation of dummy coding: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/
In addition to continuous variables, categorical variables can be included as predictors / explanatory variables in linear regression. The most common approach to including a categorical predictor is to use dummy coding. The basic idea behind dummy coding is to turn a categorical predictor with \(k\) levels into \(k\) separate dummy code variables, where each dummy coded variable is coded as 1 for a particular level of the variable and 0 otherwise. The simplest case of this is a binary variable, a categorical variable with only two levels (e.g., male and female participants) where you might create a new dummy coded variable where 1 = female and 0 = male participants. This dummy coded variable is now a numeric predictor (it has 0s and 1s) and can be included in regression as usual.
The only real difference between interpreting a dummy coded variable from a regular continuous predictor is that: (1) the 0 point is not arbitrary, it represents one particular group (e.g., male participants) and (2) a one unit change in the dummy coded variable is not arbitrary, it represents the difference between groups (because 0 to 1 is a one unit change).
R automatically dummy codes any factor variable you enter into a linear regression as a predictor, so you do not have to do it manually. If the variable is not a factor, you may want to convert it to a factor first using factor().
Let’s take a look at a simple linear regression with a categorical predictor, sex. We will store these results in m2, for model 2. We can also graph the results.
db$sex <- db$sex - 1 # recode to 1s and 0s
db$sex <- factor(db$sex)
m2 <- lm(neuroticism ~ sex, data = db)
summary(m2)
##
## Call:
## lm(formula = neuroticism ~ sex, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4622 -1.4622 0.5378 1.5378 3.9545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0455 0.4312 11.701 < 2e-16 ***
## sex1 2.4167 0.4693 5.149 8.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.022 on 139 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1602, Adjusted R-squared: 0.1541
## F-statistic: 26.51 on 1 and 139 DF, p-value: 8.772e-07
visreg(m2, xvar = "sex", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
ggtitle("A categorical predictor")
simple regression plot from visreg with categorical predictor
MLB: If you wish to interpret our results here, please read this: https://www.bbc.com/future/article/20161011-do-men-and-women-really-have-different-personalities
In this simple example, since there are no other predictors, the intercept, the expected value of neuroticism when the predictors are 0 is the mean neuroticism level in the male participants, and the regression coefficient for sex is the difference in neuroticism on average between male and female participants.
Incidentally, the results from this linear regression with dummy coding will exactly match what we would get from a t-test (compare the mean in the male group with the intercept, and the mean difference as well as the p-value for sex).
t.test(neuroticism ~ sex, data = db, var.equal = TRUE)
##
## Two Sample t-test
##
## data: neuroticism by sex
## t = -5.1491, df = 139, p-value = 8.772e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -3.344716 -1.488744
## sample estimates:
## mean in group 0 mean in group 1
## 5.045455 7.462185
It is also the same as we would get from an ANOVA (compare the F-values and the p-values).
summary(aov(neuroticism ~ sex, data = db))
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 108.4 108.44 26.51 8.77e-07 ***
## Residuals 139 568.5 4.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
RLinear regressions have a few assumptions, they assume that:
The normality assumption is typically tested by comparing the residuals from the model against a normal distribution. The independence assumption cannot readily be tested, but we normally assume it is true if participants are independent of each other (e.g., are not siblings, we don’t have repeated measures on the same person, etc.). The homogeneity of variance assumption is typically assessed by plotting the residuals against predicted values. The expectation is that across the range of predicted values, the spread (variance) of the residuals is about equal and that there should not be any systematic trends in the residuals.
Although not an assumption, we often want to identify outliers or extreme values that may unduly influence our findings. This is often done by examining the residuals from the model.
Most of these diagnostics and assumptions can be checked graphically. We use the modelDiagnostics() function to calculate diagnostics and the plot() function to create a visual display. This should be very familiar from the Data Visualization topic where we used testDistribution() and plot to test a specific variable. It is a similar idea but now operating on the residuals.
m <- lm(neuroticism ~ stress, data = db)
## assess model diagnostics
md <- modelDiagnostics(m, ev.perc = .005)
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
plot(md, ncol = 2, ask = FALSE)
Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption.
The left panel shows a density plot of the residuals (solid black line) and for reference the density of what a perfectly normal distribution would look like with the same mean and residual variance (dashed blue line). The rug plot under the density plot (small, vertical lines) shows where raw data points fall. The x axis labels (for the Residuals) shows the minimum, 25th percentile, median, 75th percentile, and maximum, providing a quick quantitative summary. The bottom of the left panel shows a QQ deviates plot, this is the same as a regular QQ plot but it has been rotated so that the line is horizontal instead of diagonal to take up less space. If the points fall exactly on the line, that indicates they fall exactly where they would be expected under a normal distribution. Both the rug plot and the dots plotted in the QQ deviates plot are light grey if they are not extreme and are colored solid black if they meet the criteria for an extreme value, as specified in the call to modelDiagnostics().
The right panel shows a scatter plot of the model predicted values against the residuals. This can help asses the homogeneity of variance assumption. If the residual variance is homogeneous, we would expect the spread of the residuals (the y axis) to be about the same at all levels of the predicted values. Sometimes this is very easy to see visually. Sometimes it is not so easy to see. To aid in the visual interpretation, quantile regression lines are added. Quantile regression lines allow a regression model to predict the 10th and 90th percentile. These predicts are added as dashed blue lines. Under the assumption of homogeneous variance, one would expect these lines to be approximately horizontal and parallel to each other. If they are not parallel, that indicates that the variance is larger / smaller for some predicted values than for others, known as heterogeneous variance. The solid blue line in the middle is a loess smooth line, which also would ideally be about flat and about zero, to indicate that there is no systematic bias in the residuals. Systematic bias can occur, for example when for low predicted values, the residuals are always positive and for high predicted values the residuals are always negative and indicate, typically, either ceiling / floor effects in the data and/or issues that would ideally be addressed by revising the model. As with the density plot, the axes in the scatter plot show the minimum, 25th percentile, median, 75th percentile, and maximum, providing a nice quantitative summary of the residuals and predicted values.
These graph looks fairly good although the loess line is not great. The diagnostics are not perfect, but real data never are and the degree of variation in the spread of residuals is not extreme (but not good). We might be reasonably satisfied that the assumption of normality and homogeneity of variance are met or at least not terribly violated, although a case could be made that the homogeneity assumption is violated. There also do not appear to be extreme values in the residuals.
For the sake of example, let’s look at another outcome measure. selfesteem. Again we will use stress as a predictor.
malt <- lm(selfesteem ~ stress, data = db)
## assess model diagnostics
maltd <- modelDiagnostics(malt, ev.perc = .005)
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
plot(maltd, ncol = 2, ask = FALSE)
Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption.
These graphs show there are three extreme values, based on the current criteria. The homogeneity of variance plot does not appear as bad, but the extreme values on the residuals are apparent. At a minimum, we would want to remove these extreme values. We could also try a log transformation of the outcome (although you wouldn’t do this if the normality actually looks ok - but let’s just do it as an example anyway).
malt2 <- lm(log(selfesteem) ~ stress, data = db)
## assess model diagnostics
malt2d <- modelDiagnostics(malt2, ev.perc = .005)
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
plot(malt2d, ncol = 2, ask = FALSE)
Regression model diagnostics after log transforming the outcome.
Even after the log transformation, the extreme values remain. Let’s remove those. Extreme values are stored and accessible within the diagnostics object output, malt2d. It shows the scores on the outcome, the index (the row in the data the extreme values come from) and the effect type where they are an extreme value, in this case, extreme on the residuals. We use the Index values to remove them from the dataset and then refit our model. We can use - in front indices to tell data table that instead of selecting those rows, we want to exclude those rows.
malt2d$extremeValues
## log(selfesteem) Index EffectType
## 1: 1.609438 24 Residuals
## 2: 2.079442 89 Residuals
malt3 <- lm(log(selfesteem) ~ stress,
data = db[-malt2d$extremeValues$Index])
## assess model diagnostics
malt3d <- modelDiagnostics(malt3, ev.perc = .005)
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
plot(malt3d, ncol = 2, ask = FALSE)
Regression model diagnostics after log transforming the outcome and excluding extreme values.
If you would like some more examples or additional explanation, have a look at: http://joshuawiley.com/JWileymisc/articles/diagnostics-vignette.html
After excluding the initial extreme values, one new extreme value emerges. However, it is relatively less extreme and overall the model diagnostics look much better now, although the homogeneity assumption could be argued to still be violated.
RYou can conduct multiple linear regressions in R almost the same way as you would a simple linear regression. Additional predictors can be added by simply using + between them. The following code shows an example adding both a continuous and categorical predictor simultaneously. Next, we run diagnostics, and only after diagnostics take a look at the summary. We leave the summary() until after diagnostics on purpose as there is not much point interpreting the coefficients if the model assumptions are badly violated. We would want to first fix the model and then interpret.
m3 <- lm(neuroticism ~ stress + sex, data = db)
m3d <- modelDiagnostics(m3, ev.perc = .005)
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
## Warning in .local(x, ...): singularity problem
## Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct
plot(m3d, ncol = 2, ask = FALSE)
Multiple regression model diagnostics.
summary(m3)
##
## Call:
## lm(formula = neuroticism ~ stress + sex, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3214 -1.3214 0.0681 1.2890 3.8471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27425 0.67666 1.883 0.0618 .
## stress 0.38951 0.05812 6.702 4.81e-10 ***
## sex1 2.15205 0.41104 5.236 5.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.763 on 138 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3664, Adjusted R-squared: 0.3572
## F-statistic: 39.91 on 2 and 138 DF, p-value: 2.108e-14
We can also get a visual summary of the model using visreg(). We separate the plot by sex since it is a categorical predictor it is easy to make separate lines for each sex group. So that the lines are plotted in the same graph and not separate graphs, we specify overlay = TRUE.
visreg(m3, xvar = "stress", by = "sex", partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A multiple regression - separate lines for each sex group")
multiple regression plot from visreg
We could interpret the results of the regression as follows.
There was a stastistically significant association between stress and neuroticism, controlling for sex group. A one unit higher stress score was associated with a 0.39 [95% CI 0.27 - 0.50] higher neuroticism score, p < .001.
There was a stastistically significant association between sex group and neuroticism, controlling for stress. Compared to male participants, female participants had 2.15 [95% CI 1.34 - 2.96] higher neuroticism scores, p < .01. Overall, stress and sex group explained 35.7% of the variance in neuroticism.
When running a simple linear regression, the overall model variance explained, \(R^2\), serves as an effect size estimate. However, in multiple regression, that only serves as an effect size for the overall model. In this case, even though we can maybe tell that more of the effect is attributable to stress, we do not have a specific quantitative effect size estimate for each predictor.
We can get a nice set of tests for a model using the modelTest() function and have a nicely formatted set of output using the APAStyler() function. We wrap it in knitr::kable() to get nice output in our HTML file.
m3test <- modelTest(m3)
knitr::kable(APAStyler(m3test))
| Term | Est | Type |
|---|---|---|
| (Intercept) | 1.27 [-0.06, 2.61] | Fixed Effects |
| stress | 0.39*** [ 0.27, 0.50] | Fixed Effects |
| sex1 | 2.15*** [ 1.34, 2.96] | Fixed Effects |
| N (Observations) | 141 | Overall Model |
| logLik DF | 4 | Overall Model |
| logLik | -278.50 | Overall Model |
| AIC | 565.00 | Overall Model |
| BIC | 576.80 | Overall Model |
| F2 | 0.58 | Overall Model |
| R2 | 0.37 | Overall Model |
| Adj R2 | 0.36 | Overall Model |
| stress | f2 = 0.33, p < .001 | Effect Sizes |
| sex | f2 = 0.20, p < .001 | Effect Sizes |
The table has several parts. The “Term” column tells us what we are looking at, the “Est” column gives us the values, and the “Type” column is the broad category. First up, we have the fixed effects, they are called fixed effects because they do not vary across individuals. These are our model parameters, the regression coefficients. What is reported is the estimates, asterisks to indicate p-values, and the 95% confidence intervals in brackets. Rounding to two decimal points is the default consistent with standard APA style.
According to Cohen’s (1988) guidelines, f2 >= 0.02, f2 >= 0.15, and f2 >= 0.35 represent small, medium, and large effect sizes, respectively.
The next section presents information about the overall model, like how many observations were included, the log likelihood (logLik), the degrees of freedom for the overall log likelihood, information criterion (AIC, BIC), which we will learn about later for model comparisons, and several overall model effect sizes: cohen’s \(f^2\) effect size, the \(R^2\) and adjusted \(R^2\). Cohen’s \(f^2\) effect size for the overall model is defined as
\[ \frac{R^{2}}{1 - R^{2}} \]
it is based on the model \(R^2\) but divided by the variance not explained, \(1 - R^{2}\).
Finally at the bottom, are cohen’s \(f^2\) effect sizes for each individual predictor along with p-values for each predictor overall. The Cohen’s \(f^2\) effect size for individual coefficients is defined as:
\[ \frac{R_{AB}^{2} - R_{A}^{2}}{1 - R_{AB}^{2}} \]
If you would like some more examples or additional explanation, have a look at: http://joshuawiley.com/JWileymisc/articles/model-test-vignette.html
where what is compared is the the variance explained by the overall model minus the variance explained by a model excluding that specific predictor. In essence, this provides an estimate of the unique variance in the outcome explained by each predictor, relative to the total variance not explained. Cohen advocated this approach because, for example, if a model already explains 90% of the variance in an outcome, it is very difficult to explain an additional 9% (that would bring the model to nearly 100% variance explained). In contrast, if a model explains 0% of the variance, adding one more predictor may have an easier time explaining 9% of the variance. Thus Cohen’s \(f^2\) effect size accounts for this. The square root of this often is used in power analyses for regression as well (e.g., in G*Power).
RSometimes the association between two variables depends on a third variable. This is called moderation. The way that we formally test for moderation in regression is by adding an interaction term. An interaction term is the product of two (or more) variables. In an equation:
\[ \boldsymbol{y} = b_0 + b_1 * \boldsymbol{x} + b_2 * \boldsymbol{w} + b_3 * \boldsymbol{x} * \boldsymbol{w} + \boldsymbol{\varepsilon} \]
The interaction term gets an additional regression coefficient (another model parameter), \(b_3\) in this example, that captures how different the association of each individual variable (\(b_1\) or \(b_2\)) with the outcome is depending on the level of the other variable.
As we are focused on moderation here, we are not evaluating model diagnostics. In research or applied settings, you would definitely want to conduct diagnostics before being confident the model results are trustworthy.
It is very easy to include interactions in R. Inside of a regression formula, we use the * operator, which expands to mean, give me the main effects and the interactions between these variables. Here is an example evaluating an interaction between stress and sex on neuroticism. Note that when there are interactions, modelTest() does not deal well with categorical predictors that are automatically dummy coded. The solution is to manually dummy code. The following code estimates the model and makes a nice summary with effect sizes.
db[, SexDummy := ifelse(sex == "1", 1L, 0L)]
mint <- lm(neuroticism ~ stress * SexDummy, data = db)
knitr::kable(APAStyler(modelTest(mint)))
| Term | Est | Type |
|---|---|---|
| (Intercept) | 0.79 [-2.08, 3.67] | Fixed Effects |
| stress | 0.44** [ 0.15, 0.73] | Fixed Effects |
| SexDummy | 2.73 [-0.44, 5.90] | Fixed Effects |
| stress:SexDummy | -0.06 [-0.37, 0.25] | Fixed Effects |
| N (Observations) | 141 | Overall Model |
| logLik DF | 5 | Overall Model |
| logLik | -278.43 | Overall Model |
| AIC | 566.86 | Overall Model |
| BIC | 581.60 | Overall Model |
| F2 | 0.58 | Overall Model |
| R2 | 0.37 | Overall Model |
| Adj R2 | 0.35 | Overall Model |
| stress | f2 = 0.07, p = .003 | Effect Sizes |
| SexDummy | f2 = 0.02, p = .091 | Effect Sizes |
| stress:SexDummy | f2 = 0.00, p = .710 | Effect Sizes |
We can also use visreg() to make a nice interaction plot.
visreg(mint, xvar = "stress", by = "SexDummy", partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A moderated multiple regression")
moderated regression plot from visreg
Now that we have the output and the interaction plot, let’s interpret what we found. In the presence of an interaction, the variables involved in the interaction on their own become simple effects. There was not a statistically significant interaction between stress and sex group. It is likely that if this interaction was not central in practice you would drop the interaction, re-run the model and interpret the results. However, to show what could be interpreted if there was a significant interaction, we will proceed. There was a significant simple effect of stress on neuroticism, such that for male participants (i.e., when SexDummy = 0), each one unit higher stress was associated with 0.44 higher neuroticism [95% CI .15 - .73], p = .003 and this was a small effect, \(f^2 = 0.07\). There was no simple effect of sex on neuroticism when stress was held at 0, (b = 2.73, [95% CI -0.44 - 5.90], p = .091) and the effect size was small, \(f^2 = .02\). The interaction was not significant, and a graph of the interaction showed that the pattern was that the association between stress and neuroticism was about the same in male and female participants. The overall model explained 37% of the variance in neuroticism in the sample.
It also is possible to have an interaction between two continuous variables. The R code is virtually identical, but we can no longer talk about “groups”. We will use an almost identical model, but with continuous self esteem instead of sex.
mint2 <- lm(neuroticism ~ stress * selfesteem, data = db)
knitr::kable(APAStyler(modelTest(mint2)))
| Term | Est | Type |
|---|---|---|
| (Intercept) | 8.61* [ 1.73, 15.48] | Fixed Effects |
| stress | -0.01 [-0.56, 0.54] | Fixed Effects |
| selfesteem | -0.36 [-0.79, 0.07] | Fixed Effects |
| stress:selfesteem | 0.03 [-0.01, 0.06] | Fixed Effects |
| N (Observations) | 141 | Overall Model |
| logLik DF | 5 | Overall Model |
| logLik | -289.78 | Overall Model |
| AIC | 589.55 | Overall Model |
| BIC | 604.30 | Overall Model |
| F2 | 0.35 | Overall Model |
| R2 | 0.26 | Overall Model |
| Adj R2 | 0.24 | Overall Model |
| stress | f2 = 0.00, p = .970 | Effect Sizes |
| selfesteem | f2 = 0.02, p = .099 | Effect Sizes |
| stress:selfesteem | f2 = 0.02, p = .153 | Effect Sizes |
We can plot the results with visreg(). Although self esteem is continuous a common way to give a sense of the interaction is to select a few specific values to plot lines at.
visreg(mint2, xvar = "stress", by = "selfesteem", partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A moderated multiple regression of two continuous predictors")
moderated regression plot from visreg
Three lines with confidence intervals in one figure is difficult to read, so we could also use the breaks argument to specify specific self esteem scale points we want it to plot the interaction for. This makes it cleaner and easier to read and is shown in the following plot.
visreg(mint2, xvar = "stress", by = "selfesteem", breaks = c(5, 10),
partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A moderated multiple regression of two continuous predictors")
moderated regression plot from visreg at user selected values of self esteem
Overall we can interpret that there is no statistically significant interaction between self esteem and stress on neuroticism (p = .153). The association between stress and neuroticism does not depend on self esteem.
Note that with continuous predictors, unless they were mean centered, the default simple effects are not very useful. For example we would say that the simple effects show that each one unit higher stress is associated with -0.01 lower neuroticism, when selfesteem = 0. Although a technically accurate interpretation, no one scored 0 on selfesteem in this dataset (because the way we totaled it is impossible) and so we probably do not want to extrapolate to self esteem = 0. This is because the default simple effects from regression models when there are interactions are holding the other variable at 0. If we mean center variables prior to testing interactions, then 0 is the mean, which gives more meaningful simple effect tests.
Here is a little summary of some of the functions used in this topic.
| Function | What it does |
|---|---|
lm() |
Fits a linear regression model |
modelDiagnostics() |
allows testing residuals for normality and for homogeneity of variance in models |
modelTest() |
tests results of model and includes effect sizes for each predictor |
visreg() |
can be used to make plots of regression models, with or without interactions |